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Abstract. We show that modularity, a quantity introduced in the study of networked 
systems, can be generalized and used in the clustering problem as an indicator for the 
quality of the solution. The introduction of this measure arises very naturally in 
the case of clustering algorithms that are rooted in Statistical Mechanics and use the 
analogy with a physical system. 
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1. Introduction 

The problem of data clustering consists of grouping together items so that two points 
belonging to the same group (cluster) are, in some sense, more similar than two that 
belong to different ones; it has applications in several fields such as pattern recognition, 
bioinformatics, learning, astrophysics and more (for a review see e.g. [1]). A data set 
composed by iV points is represented either in terms of their coordinates (features) in 
a D-dimensional space or, alternatively, by means of an TV x N "distance matrix", 
whose elements measure the dissimilarity of pairs of data points. Many clustering 
methods are now available. Some of them make some assumption about the clusters' 
density distribution, so that so that clustering becomes an optimization problem. Other 
approaches are called nonparametric as they do not impose a determinate model to the 
data set. 

The clustering problem is inherently ill-posed, i.e. any data set can be clustered in 
drastically different ways, with no clear criterion to prefer one clustering over another. 
In particular in the case of unsupervised approaches, a satisfactory clustering of data 
depends on the desired resolution which determines the number of clusters and their 
size. However, the goal of data clustering is to partition a sample according to natural 
classes that are present in it. For example in the case of gene clustering in microarray 
data one search to separate genes that are involved in a particular biological processes 
from others which act as spectators. In the features space the job consists of recognizing 
regions that are densely populated and are separated by regions in which the density is 
lower. 

The problem of natural clustering becomes very important when analyzing the 
results of clustering procedures. For example, if one applies different algorithms to the 
same data set what is the criterion to choose among them the most efficient one? Let 
us consider yet another situation: many algorithms perform hierarchical clustering, i.e. 
there is a parameter which controls the resolution at which the data set is clustered; by 
varying the value of this parameter the data set is grouped in a hierarchy of clusters 
ranging from the whole data set (one cluster) to single items (N clusters). Also in this 
situation the following question arises: what is the best value of this parameter or, in 
other words, at what resolution should I look at the data to find a scientific meaning in 
the classification? 

The problem of the most natural clustering solution has been explored in a recent 
paper [2], where an answer is given in the frame of the Superparamagnetic clustering 
(SC) method [H 0]. The authors introduce an automated sequential procedure, 
Sequential Superparamagnetic Clustering (SSC), in which SC is recursively applied: 
at each step one well distinguished cluster is separated and in the next step SC is 
applied to the rest of the sample. In this article we present an alternative procedure 
that emerges naturally in clustering algorithms that are inspired by the analogy with a 
physical system, but can be applied in a straightforward way also to other algorithms 
that perform hierarchical clustering. The central role in this approach is played by 
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Modularity, a quantity which was first introduced in order to detect community structure 
in networks; it is the object of next section. Then we briefly review one of the clustering 
methods that are based on a Theoretical Physics approach and we show that Modularity, 
if redefined starting from the coupling matrix rather than the adjacency matrix, is able 
to select the most efficient cluster partition for a data set whose correct classification is 
already known. Last section is dedicated to comments and conclusions. 

2. Networks and modularity 

A network is the representation of a data system as a set of nodes joined by edges 
that correspond to pairwise relations between nodes. Networks can be used to describe 
many systems of scientific interest, such as electric power grids, road or airline networks, 
the Internet and World Wide Web, social communities, biological or chemical systems. 
The interest in the study of networks increased in the last decade together with the 
availability of large databases, describing real world networks, and large computing 
power. Recently many authors reviewed concepts and results in this discipline [H1E1CF], 
and we refer to them for further details. 

A typical feature of real networks is the appearance of tightly connected subgraphs 
with only sparser links between them. For example, in the case of social networks the 
existence of these subgroups could be related with the presence of important cultural 
differences between individuals. Many efforts have been dedicated to the problem of 
identifying communities in a network. One of the most effective method was introduced 
by Girvan and Newman [8j [9]: their algorithm is able to produce a hierarchy of 
subdivisions in a network, from a single agglomerate to isolated nodes. They introduced 
also a quantity, the modularity Q, able to select which of the divisions is optimal. It is 
clear that, if a partition in a fixed number of subgroups of the network is requested, the 
best solution is obtained minimizing the number of edges connecting vertices belonging 
to different subgroups (or maximizing the number of vertices belonging to the same 
subgroup). But this is not a good recipe when the number of divisions is free, because 
this solution would correspond to no division at all. In this case a good division into 
communities is the one in which the number of edges between vertices belonging to the 
same group is significantly greater than the number expected from a random distribution 
of edges. This crucial concept is turned in mathematical quantity, the modularity. 

Let us consider a network composed by N vertices and an adjacency matrix A, 
where is the number of edges (0 or 1) between vertices i and j. The number of 
edges incident with node i is ki = J2j Aij an d the total number of edges is m = \ J2i ^i- 
Suppose our network has been partitioned in q groups; we can label each node % with 
an index crj e {1, 2, . . . , q} denoting its group. Modularity is defined by 



The first term in the sum contributes to the number of edges falling within groups while 
the second term is the expected number for the same quantity in case of a random 




(1) 
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distribution of edges. Thus the problem of detecting communities in a network is 
turned into the problem of maximizing the modularity. Many standard and dedicated 
optimization techniques have been investigated; see [10] and references therein. 

3. Modularity and clustering algorithms 

Extending the use of modularity to the problem of data clustering, in particular when 
hierarchical algorithms are used, requires the introduction of edges in the feature space 
where data are represented. This comes out in a very natural way in the case of some 
algorithms that are rooted in Theoretical Physics and that work through the analogy 
with a physical system. These algorithms associate a physical quantity to each point 
in the features space, the entities beeing coupled by a coupling constant that depends 
inversely by the distance. This interaction is able to drive the system to an equilibrium 
condition where entities that are closer are characterized by a more similar behavior. 
Clusters are then recognized by the measure of some variable identifying this dynamical 
similarity. This is the case of an algorithm called Chaotic Map Clustering (CMC) 
algorithm [11] to which some of the author of this paper contributed. We briefly review 
the CMC algorithm. 

We assign a real dynamical variable Xj G [1, 1] to each point of the data set and 
define pair-interactions 

J ^ =exp {"^w}' (2) 

where is a suitable measure of distance between points i and j in our D-dimensional 
space, (dij) 2 is its average over the sample and a is a parameter. The time evolution of 
the system is given by: 

Xi(t+l) = ±-Y, J *f( x j®)> ( 3 ) 

where Cj = Ylij^i Jiji anc ^ we choose the map f(x) = 1 — 2x 2 . Due to the choice of the 
function /, equations (j3J) represent the dynamical evolution of chaotic maps X{ coupled 
through pair interactions Jy. The lattice architecture is fully specified by fixing the value 
of a as the average distance of fc-nearest neighbors pairs of points in the whole system. 
For the sake of computational economy, we consider only interactions of each map with 
a limited number of maps whose distance is less than 3a, and set all other to zero. 
Starting from a random initial configuration of x, equations ^3J) are iterated until the 
system attains its stationary regime, corresponding to a macroscopic attractor which is 
independent of the initial conditions. To study the correlation properties of the system, 
we consider the mutual information Z^-, between pairs of variables whose definition [12] 
is as follows. If the state of element i is Xi(t) > then it is assigned a value 1, otherwise 
it is assigned 0: this generates a sequence of bits, in a certain time interval, which allows 
the calculation of the Shannon entropy Hi for the zth map. In a similar way the joint 
entropy is calculated for each pair of maps and finally the mutual information is 
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given by = Hi + Hj — Hij. The mutual information is a good measure of correlations 
and it is practically precision independent, due to the coarse graining of the dynamics. If 
maps i and j evolve independently then = 0; if the two maps are exactly synchronized 
then the mutual information achieves its maximum value, In 2. The algorithm identifies 
clusters with the linked components of the graph obtained by drawing a link between 
all the pairs of maps whose mutual information exceeds a threshold 9: 9 controls the 
resolution at which data are clustered. Hierarchical clustering is obtained repeating 
this procedure for an increasing sequence of ^-values: each clustering level is extracted 
as a partition of data with a finite stability region in the 9 parameter. The algorithm 
is very fast, also for huge data set requiring much computer memory, due to limit in 
the couplings that renders the interaction local. The two task, coupled maps evolution 
and hierarchical cluster tree building can be separated for computational needs. With 
regard to the choice of a we note that for small values the lattice is very fragmented, 
while for large values the dependencies of the coupling (j2j) from the distance becomes 
flat and the algorithm is unable to reveal substructures in the data set. Analyzing data 
sets for which the expected classification is known, we got good results for a between 
0.10 and 0.30, the algorithm being quite insensitive in this range. 

In precedent applications of CMC algorithm we considered as optimal partition 
of the given data set the most stable solution, i.e. the one ranging on the widest 
9 interval. However the capacity of modularity to reveal the underlying structure in 
networks suggests to extend use of this quantity also for data mining. There are two 
ways to introduce modularity in our algorithm, depending on the definition of edge 
between to elements of the data set. The first way is to replace the adjacency matrix 
element Aij, giving the link between site i and site j, with the value of the coupling 
Jij between the two maps Xi and Xj. The matrix J is also a sparse matrix, but differs 
from matrix A because its elements have real values. One can think of the data set in 
the feature space as a network with weighted edges. Another way consists of the use 
of the mutual information matrix I measuring the synchronization between the maps. 
This second method is more algorithm-depending, but has the practical advantage that 
computer memory can be freed deallocating the coupling matrix in the final stage. We 
applied both methods on many data sets that are currently used to test clustering 
algorithms. In every case we found that the maximum of modularity corresponded to 
the highest value of efficiency (percentage of correctly classified points). Modularity was 
calculated following both methods, coupling and mutual information, and we did not 
find significative differences between them. 

As an example we report here results from the application of CMC algorithm 
(with Euclidean distance) to the famous IRIS data sample of Anderson [13]. This four 
dimensional data set, which is represented in the two first principal component plane in 
fig.fCQ), has often been used as a standard for testing clustering algorithms: it consists 
of three clusters (Virginica, Versicolor and Setosa) and there are 50 objects in R 4 per 
cluster. Two clusters (Virginica, Versicolor) are very overlapping. Fig. (j2J) shows, as 
a function of 9, the following quantities (from top to bottom): the number of clusters, 
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Figure 1. The Iris data set projection in the first two principal components plane 

the efficiency, the modularity calculated using the couplings matrix J, the modularity 
calculated using the mutual information matrix /. (Points are not equidistant because 
we add a new point only when the number of clusters changes.) We point out the 
similarity between the behavior of the efficiency, that measures the goodness of the 
classification, and each of the two modularity's, in particular with respect to their 
maxima position. For 9 = 0.423 the two modularity's get their maximum value together 
with the maximum value of the efficiency, 82%. 

We also applied to the same data set the SSC method [2]. Results from this test 
are shown in figure [3l to explain them we shortly describe this algorithm. SSC is 
an extension of Superparamagnetic Clustering of Domany and coworkers [3], in which 
every point i of the data set to be analyzed is associated with a Potts spin variable 
Si — » {1,2, ...,q}, where q is of the order of tenths. Each spin is coupled to its k 
nearest neighbors with a coupling which expression is similar to [21 The system evolves 
to equilibrium configurations trough a canonical Monte Carlo procedure at temperature 
T and two points are considered to belong to the same cluster if their pair correlation 
function is larger than a threshold value. Depending from the temperature T, the 
system can be found in three phases: a ferromagnetic phase from T = to T = Tf erro , 
a paramagnetic phase for T greater than a value T = T max and a superparamagnetic 
phase at intermediate T. In the superparamagnetic phase only closest spins are strongly 
correlated, so that this phase is characterized by the formation of clusters. Two spins i 
and j are attributed to the same cluster if the pair correlation G{j = (8 8i , 8 j) exceeds a 
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Figure 2. Efficiency and modularity for the Iris data sample 



fixed threshold 0. The number of these clusters increases with T from Tf erro to T max , 
resulting in a classification hierarchy, so that temperature acts as the parameter that 
controls the resolution at which data are clustered. This method has some difficulty to 
deal with data sets that contain clusters with different densities, because these clusters 
do not appear at the same resolution (temperature) level. Sequential Supermagnetic 
Clustering tries to solve this problem through a sequential procedure in which at each 
step the most stable cluster is extracted from the data set. Then the most stable cluster 
and the residual set are clustered separately. This procedure is applied repeatedly 
resulting in a binary tree. Cluster stability here is temperature stability, i.e. 



s T 



T 



(4) 



as stability parameter is used, where T d is the temperature range over which the cluster 
emerges. At each level of the procedure the branching process is stopped if clustering 
produces subcomponents none of which has stability greater than a threshold value sq. 
Tferro, the temperature at which the cluster separates, should not be very small for a 
cluster to be considered as a natural cluster. This is true also for the final branches: 
in this case if Tf erro takes a vanishing value the points of the cluster are considered as 
noise. 

When applying SSC algorithm to the Iris data set we used for the number of Potts 
states the value q = 20, for the number of nearest neighbors the value k — 10 and for the 
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T max = °- 365 
T d = 0.078 

Tferro = 



cluster 1 

N = 50 
Sj = 0.214 
T max -0-380 
Tferro = 0-024 



cluster 2 
N = 13 

Sf = 0.194 
T max - 0-283 
Tferro = 0.166 



iris \ cluster 1 
N = 100 

T max = 0- 356 
T d = 0.069 

Tferro -0.014 



residual 

N = 87 
T max -0.362 
T d = 0.036 
Tferro = 0.015 



Figure 3. Cluster structure resulting from the application of SSC algorithm [2] to the 
Iris data set. 



stability threshold the value s@ = 0.1 (but results are stable in the range 0.1-=-0.194). As 
we see in figure EJ at the first step the method succeeds in separating the Setosa cluster 
from the Virginica and Versicolor points, but it is unable to separate these two species 
efficiently. Lowering the stability threshold under 0.099, the 87-points cluster splits in 
two clusters of 81 and 6 points, so that results get worse. It seems that SSC solves 
very smartly the problem of identifying homogeneous classes, but it meets difficulties 
when applied to recognize classes which present inhomogeneities inside, even though 
composed of elements that are close in the features space. 

Now we go back to the modularity approach and present another example given by 
the data set used in [TT] , that is also characterized by clusters of different size and density. 
It is a real data-set extracted from a LANDSAT Thematic Mapper (TM) image of an 
area in the Southern of Italy consisting of 1489 pixels each of which is represented by 
six spectral bands. The ground truth was determined by means of visual interpretation 
of areal photos followed by site visits. The area study includes seven landcover classes: 
(A) Coniferous reafforestation, 69 points; (B) Bare soil, 85 points; (C) Urban areas, 91 
points; (D) Vineyards, 300 points; (E) Cropland, 316 points; (F) Pasture, 265 points; 
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Figure 4. The LANDSAT TM image data set projection in the first two principal 
components plane. Colors distinguish among different landcover classes. 



(G) Olives groves, 363 points. As reported in [TT] our algorithm succeeded in resolving 
the data-structure identifying seven cluster that where stable on a large range of 8; 
96.6% of data was classified; the purity of the classification (percentage of correctly 
classified points) is 96.2% so that the efficiency was 92.9%. We remember also that we 
searched for the best classification requiring that it would be the one surviving for the 
largest range in 8. We found that both the six clusters partition and the seven clusters 
one appeared stable: prior knowledge was needed, in this case, to select the correct 
partition of the data set. We reanalyzed this data set according to the criterion of 
maximum modularity. The results from these analysis are essentially the same as those 
reported previously: as we see in fig (jSJ) modularity reached its maximum value of 0.71 
at 8 = 0.35 in presence of 7 big clusters and 19 small clusters of few points, the efficiency 
being 93.49. This slight improvement in itself does not justify the introduction of the 
modularity criterion; however we note that, for this data set, the previous analysis had 
such a good performance that it was not possible to improve much further. Finally, we 
emphasize that the problem of choose choosing between the two stable solutions found 
an optimal solution without an external supervision. 

4. Conclusions 

We have shown that the problem of finding the optimal classification in hierarchical 
clustering can be turned into the problem of finding communities in a weighted network. 
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Figure 5. Efficiency and modularity for the LANDSAT data sample 



Results reported in this article are obtained using the CMC algorithm, but it should be 
clear that the modularity method can be applied to any hierarchical clustering algorithm. 
For example it can be applied also to the Super-paramagnetic clustering (SPC) approach. 
Here the role of 8 is played by temperature, but couplings between the elements of the 
data set have the same definition. We stress two important characteristic of the method 
here proposed. The first one is that no new parameter is introduced. The second feature 
is that it reduces considerably the computational time with respect to the alternative 
method described in [2], where one needs many Monte Carlo runs to find just one cluster. 
We remember also that, in that method, once one cluster is identified, it is eliminated 
from the sample and this could result in the loss of the collective aspect of the data 
distribution. 

We are aware that the problem of finding the most natural clusters has not a unique 
solution, but it seems to us that the modularity maximum method corresponds to an 
intuitive idea of optimal classification as the one in which connectivity within groups is 
as far as possible the one obtained from a random distribution. 
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